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The conventional definition of a topological metric over a space specifies properties that must be 
obeyed by any measure of “how separated” two points in that space axe. Here it is shown how to 
extend that definition, and in particular the triangle inequality, to concern arbitrary numbers of 
points. Such a measure of how separated the points within a collection are can be bootstrapped, 
to measure “how separated” from each other are two (or more) collections. The measure presented 
here also allows fractional membership of an element in a collection. This means it directly concerns 
measures of “how spread out” a probability distribution over a space is. When such a measure is 
bootstrapped to compare two collections, it allows us to measure how separated two probability 
distributions axe, or more generally, how separated a distribution of distributions is. 

PACS numbers: 


I. INTRODUCTION 

The conventional definition of a topological metric for- 
malizes the concept of distance. It specifies properties 
required of any function that purports to measure “how 
separated” two elements of a space are. However often 
one wants to measure “how separated” the members of 
a collection of more than two elements is. The conven- 
tional way to do this is to combine the pair-wise metric 
values for all pairs of elements in the collection, into an 
aggregate measure. This is ad hoc however. 

As an alternative, here the formal definition of a topo- 
logical metric is extended to apply to collections of more 
than two elements. In particular, the triangle inequal- 
ity is extended to concern such collections. The measure 
presented here applies even to collections with duplicate 
elements (i.e., to bags). It also applies to collections with 
“fractional” numbers of elements, i.e., to probability dis- 
tributions. 

This measure can be directly incorporated into many 
domains where ad hoc combinations of pair-wise metrics 
are currently used. In addition, when applied to different 
projections of a high-dimensional data set, it provides a 
novel type of vector-valued characterization of that data 
set. 

This new measure can be bootstrapped in a natural 
way, to measure “how separated” from each other two 
collections are. In other words, given a measure p of 
how separated from each other the elements in an arbi- 
trary collection £ are, one can define a measure of how 
separated from each other two collections and £2 are. 
(Intuitively, the idea is to subtract the sum of the mea- 
sure’s values for each of the two separate collections £1 
and £2 from the value of the measure for the union of 
the collections.) More generally, one can measure how 
separated a collection of such collections is. Indeed, with 
fractional memberships, such bootstrapping allows us to 
measure how separated a distribution of distributions is. 

In the next section the definition of a multi- argument 
metric (multimetric, for short) is presented. Also in 
that section is an extensive set of examples and a list 
of some elementary properties. For instance, it is shown 


that the standard deviation of a probability distribution 
across R A ’ is a multimetric, whereas the variance of that 
distribution is not. 

The following section presents a way to bootstrap from 
a multimetric for elements within a collection to a multi- 
metric over collections. Some examples and elementary 
properties of this bootstrapped measure are also in that 
section. 

A short concluding section considers some of the pos- 
sible uses of multimetrics. 


II. MULTIMETRICS 

Collections of elements from a space X are represented 
as vectors of counts, i.e., functions from x £ X — * 
{0, 1,2, . . So for example, if X = {A,B, C}, and we 
have the collection of three A’s, no FFs, and one C , we 
represent that as the vector (3,0,1). This use of the 
support of a vector to indicate points is analagous to 
how wave functions are interpreted in quantum mechan- 
ics. Like in quantum mechanics, here it is natural to 
extend the representation; for current purposes we only 
need extend it to include functions from x € X to R. In 
particular, doing this allows us to represent probability 
distributions (or density functions, depending on the car- 
dinality of X) over X. Accordingly, our formalization of 
multimetrics will provide a measure for how how spread 
out a distribution over distributions is. [11] Given X , the 
associated space of all functions from X to R is written as 
R x . The subspace of functions that are nowhere- negative 
is written as (M + ) x . 

As a notational comment, integrals are written with 
the measure implicitly set by the associated space. In 
particular, for a finite space, the point-mass measure is 
implied, and the integral symbol indicates a sum. In ad- 
dition 6 X is used to indicate the appropriate type of delta 
function (Dirac, Kronecker, etc.) about x. Other short- 
hand is Tt x = (M + ) x — {0} and ||u|| means / dx v(x). 

In this representation of collections of elements from 
X, any conventional metric taking two arguments in X 
can be written as a function p over a subset of the vectors 
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in 1Z X . That subset consists of all vectors that either 
have two of their components equal to 1 and all others 
0, or one component equal to 2 and all others 0. For 
example, for A = {A, B, C}, the metric distance between 
A and B is p{ 1, 0, 1), and from A to itself is p( 2, 0, 0). 

Generalizing this, a multimetric for T(X) C 7 Z x is 
defined as a real-valued function p over 1Z X such that 
Vu, v, w E 1Z X , 

1 )u,v,wET(X) => p(u+v) < p{u+w)+p(v + w). 

2) p{u) > 0, p(k5 x ) = 0 Vx, k > 0. 

3) p{u) = 0 => u — kS x for some k, x. 

Consider the requirement for a (two-argument) metric 
that it be symmetric under permutation of the points its 
considering. The analogous requirement that a multi- 
metric be symmetric under permutations of the elements 
in the collection is satisfied automatically, simply due 
to how that collection is represented. Next consider the 
case where collections, if only one x E X is in a collec- 
tion (perhaps occurring more than once), then only one 
component of u is non-zero. Accordingly, conditions (2) 
and (3) are extensions of the usual condition defining a 
metric that it be non-negative and equal 0 iff its argu- 
ments are the same. Condition (1) is an extension of 
the triangle inequality, to both allow repeats of elements 
from X and/or more than two elements from X to be in 
the collection. Note though that condition (1) involves 
sums in its argument rather than (as in a conventional 
norm-based metric for a Euclidean space) differences, in- 
tuitively, T(X) is that subset of 1Z X over which the gen- 
eralized version of the triangle inequality holds. [12] 

Condition (1) implies that multimetrics obey a second 
triangle inequality, just as conventional metrics do: 

p(u + v)> | p(u + w) - p(v + w)\. 

(This follows by rewriting condition (1) as p(u + w) > 
p(u + v) — p(v + w), and then relabeling twice.) 

Example 1: Set A' = R N . Take T( X) to be those 
elements of TZ X whose norm equals 1, i.e., the proba- 
bility density functions over R^. Then have p(s) for 
any s € 1Z X (whether in T(X) or not) be the stan- 
dard deviation of the distribution jAt, i.e., p(s) = 

Conditions (2) and (3) are immediate. To understand 
condition (1), first, as an example, say that all three of 
u, v and w are separate single delta functions over X. 
Then condition (1) reduces to the conventional triangle 
inequality over R iV , here relating the points (in the sup- 
ports of) u, v and w. This example also demonstrates 
that the variance (i.e., the square of our p) is not a mul- 
timetric. 


For a vector s that involves multiple delta functions, 
p(s) measures the square root of the sum of the squares of 
the Euclidean distances between the points (in the sup- 
port of) s. In this sense it tells us how “spread out” those 
points are. Condition (1) even holds for vectors that are 
not sums of delta functions however (see appendix) . 

Example 2: As a variant of Ex. 1, have X be the unit 
simplex in R N , and use the same p as in Ex. 1. In this 
case any element of A is a probability distribution over a 
variable with N possible values. So any element of T( X) 
is a probability density function over such probability 
distributions. In particular, say s is a sum of some delta 
functions for such an X. Then p(s) measures how spread 
out the probability distributions in (the support of) s are. 
If those probability distributions are themselves sums of 
delta functions, they just constitute subsets of our N val- 
ues, and p(s) measures how spread out from one another 
those subsets are. 

Example 3: As another variant of Ex. 1, for any A, 
take T’(A) = 7l x . Define the tensor contraction ( s | 
t) = f dxdx's(x)t(x)F(x, x') where F is symmetric and 
nowhere-negative, and where F[x,x') = 0 o- x — x’ . 
Then p(s) = y/ (s | s) obeys conditions (2) and (3) by 
inspection. It also obeys condition (1) (see appendix). 

Note that the (., .) operator is not an inner product 
over the extension of T( A) to a full vector space. 
When components of s can be negative, (s, s) may be as 
well. Note also that there is a natural differential geomet- 
ric interpretation of this p when A consists of N values. 
Say we have a curve on an A-dimensional manifold with 
metric tensor F at a particular point on the curve, and 
that at that point the tangent vector to the curve is s. 
Then p(s) is the derivative of arc length along that curve, 
evaluated at that point. 

This suggest an extension of this multimetric, in which 
rather than a tensor contraction between two vectors, we 
form the tensor contraction of n vectors: (s 1 , . . . , s n ) = 
f dx 1 . . . dx n s 1 (x 1 ) . . . s n (x n )F(x 1 , . . . ,x n ), where F is 
invariant under permutation of its arguments, nowhere- 
negative, and equals 0 if and only if all its arguments 
have the same value. Any p(s ) that is a monotonically 
increasing function of (s, s , . . . , s) 1 ^ n automatically obeys 
conditions (2) and (3). 

It is worth collecting a few elementary results concern- 
ing multimetrics: 

Lemma 1: 

1. Let {pi} be a set of functions that obey conditions 
(2) and (3), and {a t } a set of non-negative real 
numbers at least one of which is non-zero. Then 
^2 { aipi also obeys conditions (2) and (3). 

2. Let { pi } be a set of functions that obey condition 
(1), and {dj} a set of non-negative real numbers at 
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least one of which is non-zero. Then JT a l p l also 
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3. Let / : R — * R + be a monotonically increasing 
concave function that equals 0 when its argument 
does. Then if p is a multimetric for some T(X), 
f(p ) is also a multimetric for T(X) (see appendix). 

4. Let / : X — > Y be invertible, and let py be a multi- 

metric over Y. Define the operator B j : 1Z X —* VJ 
by [Bf(s)](y) = «(/ _1 (y)) if exists, 0 oth- 

erwise. Bf is a linear operator. This means 
Px(s) = py(A/(s)) is a multimetric [13]. 

Example 4: Take X = R N again, and let T(X) be all of 
7Z x with bounded support. Then by Lemma 1, the width 
along xi of (the support of) s € T(X) is a multimetric 
function of s (see appendix). 

This means that the average of the width in aq over all 
possible rotations of X is also a multimetric. Similarly, 
consider the smallest axis-parallel box enclosing the (sup- 
port of the) Euclidean points in s. Then the sum of the 
lengths of the edges of that box is a multimetric function 
of s. 

On the other hand, while the volume of that box obeys 
conditions (2) and (3), in general it can 'violate condition 
(1). Similarly, the volume of the convex hull of the (sup- 
port of) the points in s obeys conditions (2) and (3) but 
can violate (1). (In general, multimetrics have the di- 
mension of a length, so volumes have to be raised to the 
appropriate power to make them be multimetrics.) 

It is worth comparing the sum-of-edge-lengths multi- 
metric to the standard deviation multimetric of Ex. 1 for 
the case where all arguments s are finite sums of delta 
functions (i.e., “consist of a finite number of points”). For 
such an s we can write the sum-of-edge-lengths multimet- 
ric as a sum over all N dimensions i of maxq sj — mim, sj, 
where s 3 is the j'th point in s. In contrast, the (square 
of the) standard deviation multimetric is also a sum over 
all i, but of the (square of the) standard deviation of the 
i’th components of the points in s. Another difference is 
that the standard deviation multimetric is a continuous 
function of its argument, unlike the sum-of-edge- lengths 
multimetric. 

Example 5: Let X be countable and have T(X) = 1Z X . 
Then p(s) = f dxQ(s(x)) — 1 where 0 is the Heaviside 
function is a multimetric (see appendix). This is the 
volume of the support of s, minus 1. 

Example 6: Let X be countable and have T(X) = TZ X . 
Then p(s) = ||s|| — max I .s(x) obeys conditions (2) and 
(3), by inspection. Canceling terms, for this p condition 
(1) holds iff ma x z (u(x) + v(x)) > ma x x (u(x) + w(x )) + 
max I (n(x) + w(x)) — 2 j |nu| ] . This is not true in general, 
for example when ||tn|| = 0 and the supports of u and 
v are disjoint. However if we take T(X) to be the unit 


simplex in 1Z X , then condition (1) is obeyed, and p is a 
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Example 7: Let X have a finite number of elements and 
set T(X) = TZ X . Say that p(s) = 0 along all of the axes, 
and that everywhere else, k < p(s) < 2k for some fixed 
k > 0. Then p is a multimetric. 

A. Vector- valued multimetrics 

It is straightforward to extend the definition of a mul- 
timetric to have range R M rather than R, so long as one 
has a linear ordering over 1R M to specify the appropri- 
ate extension of condition (1). For example, consider 
the component-wise ordering: Va, 6, a < 6 <=> a, < Yi <S 
{1, 2, ... M}. Say we have a set of M scalar multimetrics. 
Then the M-fold Cartesian product of those multimetrics 
is an Af-dimensional multimetric, when component-wise 
ordering defines the inequality in condition (1). 

More generally, say we have chosen such a linear order- 
ing over R M , and have an M -dimensional function with 
domain 7 Z x . Say this function obeys conditions (1) and 
(2) of an M-dimensional multimetric for our linear order- 
ing. Then this function can be used as a low-dimensional 
characterization of an element of 1Z X . In general, such 
characterizations may have M is less than |R X |, the di- 
mension of the space in which 1Z X is embedded, and may 
violate condition (1). The following examples illustrates 
this: 

Example 8: Consider again Ex. 1. To define our vector- 
valued multimetric for the X of Ex. 1, say we have a 
scalar multimetric ps for the subspace, X' = T(X') = R. 
Let {u 1 , v 2 ,. .. v M } be a set of M unit norm vectors living 
in X. Then we can define our M-dimensional multimetric 
by 

Pi(u) = Pr [/»,„<(£)]; 

/ u ,„i(f) = J dx u(x)6(v t ■ x -t). 

To illustrate this, take M — N and have the {V} be 
the unit normals along the N axes of X. Let u be a sum 
of delta functions; u = 6 X i + 5 X 2 . Let pr be the standard 
deviation multimetric of Ex. 1 for one- dimensional prob- 
ability density functions. So each component pi(u) is just 
the z’th component of the difference x ] —x 2 . Accordingly, 
u can be reconstructed from the vector p,(u). 

In this illustration conditions (1) and (2) are imme- 
diate. If M = N, then condition (3) also holds for u’s 
like the one considered here that are sums of two delta 
functions, but not more generally. Now modify this il- 
lustration by having M < N and the {V} not all point 
along the axes of X. Then for general u , the components 
Pi(u) are the projections of u along the different vec- 
tors jV}. As in techniques like Principal Components 
Analysis [2], those projections provide a low-dimensional 
characterization of u. 


III. CONCAVITY GAPS AND DISPERSIONS 

In Ex. 1, p can be used to tell us how spread out 
a distribution over M N is. One would like to be able 
to use that p to construct a measure of how spread out 
a collection of multiple distributions over 1R' V is. Intu- 
itively, we want a way to construct a metric for a space 
of sets (generalized to be able to work sets with dupli- 
cates, fractional memberships, etc.) from a metric for 
the members within a single set. This would allow us 
to directly incorporate the distance relation governing X 
into the distance relation for 7Z X . 

To do this, first let {Y, S’(Y)} be any pair of a subset 
of a vector space together with a subset of R y such that 
Vy 6 S(Y), € Y. (As an example, we could take 

Y to be any convex subspace of a vector space, with S(Y) 
any subset of 1Z Y .) Then the associated concavity gap 
operator C : S(Y) —* M s(y ) is defined by 

_Jdyg(y)y , J dy g(y)a{y) 

where y € Y, and both a and g are arbitrary elements 
in S(Y). So the concavity gap operator takes any single 
member of the space S(Y) (namely a) and uses it to 
generate a function (namely, Ccr) over all of S(Y). [14] 

In particular, say Y = T(X) for some space X. Say 
we are given a multimetric cr measuring the (X-space) 
spread specified by any element of Y. Say we are also 
given a g which is a normalized distribution over Y. Then 
Ca(g ) is a measure of how spread out the distribution g is. 
Note that in this example space S(Y) is both the space 
of multimetrics over Y and the space of distributions for 
Y, exemplified by a and g, respectively. 

We can rewrite the definition of the concavity gap in 
several ways: 


Ca(g) = a{E g (y)) - E g (a) 
( y ■g ] a-9 


where E g means expected value evaluated according to 
the probability distribution and in the last expres- 
sion y is the (infinite-dimensional) matrix whose y’th col- 
umn is just the vector y, and the inner products are over 
the vector space S(Y). Taken together, these equations 
say that the concavity gap of a, applied to the distribu- 
tion g, is given by evaluating the function a at the center 
of mass of the distribution g, and then subtracting the 
inner product between a and g. 


Example 9: Let Y = R w , and choose S(Y) to be 
the set of nowhere- negative functions of Y with non- 
zero magnitude. Choose cr(y) = 1 — Xa=i Vi- Then 
Ca{g) = Var(jj^jj). 

Example 10: Say X has N values, with T(X) = 7Z X . 
Consider a u € T(X) whose components are all either 0 


or some particular constant a such that f dx u(x) = 1. 
So u is a point on the unit hypercube in T(X), projected 
down to the unit simplex. Let T be the set of all such 
points u. In the usual way, the support of each element 
of T specifies a set of elements of X. 

Let Y = T(X), and have S(Y) = TZ Y . Have g be 
a uniform average of a countable set of delta functions, 
each of which is centered on a member of T. So each of 
the delta functions making up g specifies a set of elements 
of X; g is a specification of a collection of such X-sets. 

In this scenario a(E g (y)) is a applied to the union (over 
X) of all the X-sets specified in g. In contrast, E g (a) is 
the average value you get when you apply o to one of 
the X-sets specified in g. Ca(g) is the difference between 
these two values. Intuitively, it reflects how much overlap 
there is among the X-sets specified in g. 

Example 11: Say X has N values, with T(X) = TZ X . 
Have Y = T(X), and S(Y) = X y , i.e. , the set of all 
nowhere-negative non-zero functions over those points 
in R N with no negative components. Choose a(y) = 
H{y) Vy e Y, where H(.) - - f dy y(x)ln[y(x)], the 
Shannon entropy function extended to non-normalized 
y. This a is a natural choice to measure how “spread 
out” any point in Y with magnitude 1 is. 

Have g be a sum of a set of delta functions, about the 
distributions over B, {u\« 2 ,...}. Then Ccr(g) is a mea- 
sure of how “spread out” those distributions are. In the 
special case where g = <V + <5„2, C(r(g) is the Jensen- 
Shannon divergence between v 1 and v 2 [6, 8]. More gen- 
erally, if g is a probability density function across the 
space of all distributions over B, Ca(g) is a measure of 
how “spread out” that density function is. 

There are several elementary properties of concavity 
gaps worth mentioning: 

Lemma 2: 

1. C is linear. 

2. Ccr is linear <^=> it equals 0 everywhere <=> a is linear. 

3. Ccr is continuous ^ cr is continuous. 

4. Ca(g) — 0 if g oc 6 y ' for some y' € Y. 

5. Giving Ccr and the values of a at 1 + |Y| distinct 
points in Y fixes the value of a across all Y . (|Yj 
is the dimension of Y .) 

6. The equivalence class of all cr' having a particular 
concavity gap Ccr is the set of functions of y € Y 
having the form {cr(y) + b ■ y + a : a S M, b € 
Y,cr(y) + b ■ y + a € S(Y)}. 


Proof: (2.1) and (2.4) are immediate. The first iff in 
(2.2) follows from the fact that Ccr(g) = Ccr(ag) Va 6 M- 
To see the forward direction of the second iff, take 
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g = S y j 2 and h — 5 y >/2 and expand Ca(g + h) = 
< r(?/ + y ') [cr(y) -\-a\y )j. ro see Lne iuiwcuu direction of 
(2.3), choose g to have its center of mass infinitesimally 
to one side of the discontinuity in a , and then move it 
infinitesimally to the other side to get a discontinuity in 
the associated values of ( C(a))(g ). 

To prove (2.5), consider the case where Y is one- 
dimensional for simplicity. Say I give you a at A and 
at B > A, and also give you Ca. Then for every C > B. 
choose g = ^ 5 a + ^Zc ^c- Evaluating the associated 

value Ca(g) — a(B) — ^%a(A) — jE^ cr(C) allows us to 
solve for a(C). Similar reasoning holds for C < A and 
C £ (A. B). For higher dimensions we need the value of 
a at one extra point for each extra dimension of Y. This 
completes the proof. 

To prove (2.6), first note that all members of that set do 
indeed have the same concavity gap, Ca. To complete 
the proof we must show that there are no other a with 
that concavity gap. Let a' be any element of S(Y) with 
the same concavity gap as a. By (2.5), if we know the 
value of <r' at a total of 1 + |T| points in Y, then we know 
a' in toto. In turn, for any such set of 1 + |y| values, we 
can always find an a and b such that a + b ■ y + a(y) 
lies in S(Y) and has those values. This means that a' is 
identical to that a + b ■ y + cr(y). QED. 

By (2.4), Ca necessarily obeys the second part of condi- 
tion (2) if S(Y ) = n Y . 

Next define a (strict) dispersion over a space X as 
a (strictly) concave real-valued function over 1Z X that 
obeys conditions (2) and (3) of a multimetric Vu, v, w £ 
H x . 

Example 12: Take X = {1,2}, with T{X) = H x = 
R 2 — {0}. Define a(u € M 2 ) to equal 0 if iq = 0 or 
U 2 = 0, and equal ln(l + ui) +ln(l + « 2 ) otherwise. Then 
o’ is a (not everywhere continuous) strict dispersion. 

Example 13 : The X, 1Z X , and p of Ex. 3 form a strict 
dispersion (see appendix). 

Example 14 : The X,7 Z x , and a of Ex. 5 form a dis- 
persion. 

Example 15 : The X,7Z x , and a of Ex. 11 form a strict 
dispersion. 

There are several relations between concavity gaps and 
dispersions: 

Lemma 3 : Let T(X) = 7Z X . 

1. a is a dispersion over T(X) => a is nowhere- 
decreasing over T(X). 

2. a is a dispersion over T(X) and a(s) is independent 
of ||s|j V s 7^ 0 € T(X) => a is constant over the 
interior of T(X). 


3. a is (strictly) concave over T(X) Ca obeys con- 
dition (2) in full (and condition (3)) over T(X). 

4. Say that a is continuous over T( X). Then Ca is 
separately (strictly) concave over each simplex in 
T{X) «=> a is (strictly) concave over T(X). 

Proof: (3.1) arises from the fact that a dispersion is 
both concave and nowhere negative. To establish (3.2), 
first consider any two vectors u, v in the interior of T(X) 
that differ in only one component, i. Since no compo- 
nent of u equals 0, there must be an s £ T(X) such 
that 1M = il^+fr > Vi,s = (Ui - Vi)6i . Other- 

wise Si = 0, and Sj = Uj\ ^ — 1] Vj / i.) component is i, 
which is set so that pkjif*- = ~ for every j / i.) By (3.1), 
this means that if a is independent of the magnitude of 
its argument, a(v) < a(u). Since the reverse argument 
must also hold, we have a(u) = a(v). Now repeat this 
reasoning to equate a(v) with a(w) for some w that dif- 
fers from v in only one component, but differs from u in 
two components. Continuing in this way, we equate a(u ) 
with <7(2) for any z that differs in an arbitrary number 
of components from u. 

(3.3) is immediate from the definition of concavity 
and Jensen’s inequality. To derive (3.4), first ex- 
pand Ca(^) - _ 

g(ga , (y))+g(£6(y)) w h en || a || _ ||/j|j (y being a generic ar- 
gument of T{X).) In other words, this equality holds 
when a and b are on the same (not necessarily unit) sim- 
plex. Next invoke (2.3) to allow us to apply Jensen’s 
inequality. QED. 

Let / : M — » K be monotonically increasing and strictly 
concave. Then by Lemma 3.3, if ct is strictly concave, 
f{Ca) obeys conditions (2) and (3). For example, this is 
the case for \ZCa . In other words, so long as o' is a strict 
dispersion, VCa obeys those conditions. 

On the other hand, Lemma 3.2 means that any non- 
trivial a that normalizes its argument (so that it is a 
probability distribution) and then evaluates a function 
of that normalized argument cannot be a dispersion. So 
for example, if a concavity gap is a dispersion, it must be 
constant. 

Fortunately it is not the case that if f(Ca ) is a mul- 
timetric it must be constant. In particular, often for a 
strictly concave a, VCa for space {Y, 5(F)) is a multi- 
metric for an appropriate T{Y) C S(Y). 

Example 16: Choose (cr, Y,5(F)} as in Ex. 11, and 
take T(Y) to be all elements of S(Y) which are sums 
of two delta functions. This a is strictly concave, so we 
know conditions (2) and (3) are obeyed by VCa. Fur- 
thermore, for this choice of T(Y), obeying condition (1) 
reduces to obeying the conventional triangle inequality of 
two- argument metrics, and it is known that the square 
root of the Jensen Shannon divergence obeys that in- 
equality [3, 8]. Therefore all three conditions are met. 
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Example 17: Choose {a, Y, 5(E)} as in Ex. 9. As in 
Ex. 16, this a is strictly concave, and therefore VCa 
automatically obeys conditions (2) and (3). Now take 
T(Y) = S(Y). Write Ccr(g) as (g,g) for the tensor con- 
traction of Ex. 3, where F(y,y') = ^ v ~ v ^ y ~ v So by 
that example, we know that VCa is a multimetric. 


IV. POTENTIAL USES OF MULTIMETRICS 

In addition to their intrinsic mathematical interest, 
multimetrics have numerous potential applications. One 
of them is to allow more nuanced complexity measures 
for physical systems, as described in [9], Following is a 
list of some other from machine learning [2]: 

1. Mixture of Gaussians density estimation: In den- 
sity estimation one is given a data set of vectors 
{x 1 } that were generated by IID sampling an un- 
known distribution over X and wants to infer that 
distribution. Say we have a probability distribution 
across X given by a linear combination of n Gaus- 
sian distributions over X, centered at the n points 
p l . Such a distribution induces a probability of the 
set {x 1 }. Accordingly, one way to estimate the dis- 
tribution that generated {x 1 } is to search for the 
linear combination of n Gaussians that maximizes 
the associated probability of {x 1 }. 

One shortcoming of this procedure is that even 
though there are a total of n points used to param- 
eterize the distribution, that distribution is based 
solely on metric values for pairs of points (namely 
the distances between x and each of the p 1 ). If 
we have a multimetric p though, we have sev- 
eral ways to avoid this. For example, we could 
model the probability of each x l as a Gaussian of 
p(S x i + ) • We would then take the probabil- 

ity of {x r } to be the product of the probabilities of 
the x *, as in conventional Gaussian mixtures mod- 
eling. We could even model the probability of {x 1 } 
given the n points p 1 as a single Gaussian, with 
argument p(|A 6 X .< + 6^). 

2. Kernel density estimation: In kernel density esti- 
mation, one does not estimate the distribution over 
x as a linear combination of n kernel functions (e.g., 
Gaussians) that are free to be centered anywhere 
in X , and then search for which such linear com- 
bination maximizes the probability of one’s data. 
Instead one centers a kernel function on each of the 
data points, and searches for the optimal parame- 
ters of those kernels functions. Conventionally such 
kernel functions only take two arguments. However 
exactly as in application 1, if one has a multimetric 
over X, one can use kernel functions whose argu- 
ment involves more than two points at once. 


3. Classification can always be done via density esti- 
mation and Bayes’ theorem. So with applications 
1 and 2, we have new ways of doing classification. 

4. Kernel machines are a recent advance in machine 
learning in which data is first mapped non-linearly 
into a feature space where standard algorithms 
(like linear regression, linear discriminant analysis, 
PCA, etc) are applied [1]. Because of the non-linear 
mapping such methods work even when relation- 
ships in the data are highly non-linear. All that is 
required for such methods is a positive definite ker- 
nel function, k(x,x'), giving inner products in the 
feature space. Multimetrics are not positive defi- 
nite functions but can easily be made so by taking 
k(x,x') ~ exp[— p(<5 x + <5 X ')] as the kernel. So any 
of the multimetrics discussed above can be used for 
statistical analysis with kernel-based learning algo- 
rithms. In particular, this is the case for either 
for supervised or unsupervised learning with kernel 
machines. In particular, we can use exponentials of 
multimetrics for regression by using them instead 
of the conventional kernels of kernel machines, with 
the multiplicative coefficients of each kernel (in the 
linear combination that gives our fit to the data) set 
to minimize some appropriate quadratic objective 
function. 


V. APPENDIX 
A. Proof of Lemma 1.3 

That f(p) obeys conditions (2) and (3) when p does 
is immediate. To prove that condition (1) is obeyed, 
consider any u,v,w G T(X) such that p(u + v) < p(u + 
w) + p(v + w). First assume that p(u + v) < max[p(u + 
w), p(v -f w)\. Then since / is increasing, f(p(u + v)) < 
ma x[f(p(u + w)),f(p(v + w))]. Since in turn max[/(p(u + 
w)), f(p(v + u>))] < f(p(u + w)) + f(p{v + w), condition 
(1) is obeyed. 

Now consider the other case, where p(u + v) > 
max[p(u + w),p(v + w)]. In this situation, because / 
is concave, we know that / increases p(u + v) less than 
it increases both p(u + w) and/or p(v + w). So again 
condition (1) is obeyed. QED. 


B. Proof of claim in Ex. 1 

Consider any u, v and w whose norms equal 1. Then 
squaring both sides of condition (1) for our p implies that 
Var{ h±2) < Var{^) + Var { ^) + 

2 yJVar(^-)Var(^). 

Use the expansion Var(^~) — VaT(s)+var (t) _|_ 

( ) 2 a nd cancel terms. The hardest case for 
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the resultant inequality to hold is where our three van- 
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(1) holds if for any three real numbers a, b, c, 

| a — b\ < |a — c| + \b — c\. 

This is just the conventional triangle inequality though. 
So condition (1) always holds. QED. 

C. Proof of claim in Ex. 3 

First note that for any s £ T(X), (s j s) > 0, since 
all all components of F axe non-negative. Furthermore, 
all s,t £ T(X),(s 1 1) > 0, since all components of those 
vectors are non-negative as are all components of F. In 
addition, we can use the properties of F to prove that our 
tensor contraction obeys the Cauchy-Schwartz inequal- 
ity: ( u | v) 1 2 < {u | u)(v | v ) Vu,v £ T(X). (Exampnd 
(s, s) > 0 for s = u — av. Solve for the a minimiz ing the 
lefthand side (which is quadratic in a), and plug that in. 
Collecting terms establishes the desired inequality.) 

Now to check condition (1) for our p, square both sides 
of it and cancel terms. So the lefthand side is just ( u \ v). 
Since all inner products ax e non-negative , the right-hand 
side is bounded below by y/{u | u){v | v). Plugging in the 
Cauchy-Schwarz inequality establishes that condition (1) 
does indeed hold. QED. 

D. Proof of claim in Ex. 4 

Conditions (2) and (3) are immediate. To prove con- 
dition (1), first note that it holds for pi(s) = max(xi : 
s(xi / 0). Then note that it holds for pi(s') = — min(xi : 
s(xi ^ 0), and invoke Lemma 1.2, to see that the width 
in xi of the support obeys condition (1). QED. 

E. Proof of claim in Ex. 5 

Conditions (2) and (3) axe immediate. Condition (1) 
also holds if (the supports of) u and v overlap, since any 
non-zero volume must equal at least 1, and that overlap 
volume gets counted twice in the sum p(u + w) + p(v + 
w), regardless of w. If (the supports of) u and v do 
not overlap, then (the support of) w must either extend 
outside of (the support) of u or of v. This means that 
condition (1) must hold in this case as well. QED. 

F. Proof of claim in Ex. 6 

Define argmax x u(x) = a, argmax x u(x) = 
b, max x (u(x) + v(x)) = M, and max x (u(x) + w(x)) + 


max 1 (r(i) -F w(x)) — 2||u;|| s N ; we want to prove 
that Af > N Tn that. end. note that, if the support 
of w(x) is restricted to a and 6, then N becomes 
u(a) + v(b ) — ||w|| = u(a) + v(b) — 1 < u(a). On the 
other hand, M is bounded below by u(a). So condition 
(1) holds for this situation. 

We now consider the situation where ins support is 
not restricted to a and b. It will be useful to define 
argmax x (it(x)-(-ii’(x)) = dandargmax x (u(x)-fio(x)) = e. 
First consider the case where d e. Then it is immedi- 
ate that by transferring any w(c g {d, e}) to w(d) and/or 
u;(e), we do not decrease N (since ||u.'|| doesn’t change). 
We can then transfer w(d) to w(a) and w(e) to w(b), 
again not decreasing N. After doing this for all such 
points c, we recover the case where the support of w(x) 
is restricted to a and b, as in the preceding paragraph. 
So we can conclude that condition (1) is obeyed for this 
case. 

The remaining case to consider is where d — e. For this 
case we can transfer all w(c ^ d) to w(d), and in doing so 
increase N. Doing this for all such c restricts in’s support 
to d. After having done this, max x (u(x) +v;(x)) = u(d) + 
|H| < u(a) + 1, and similarly max I (t)(x) + w(x)) < 
v(b) + 1. So we again get N < u(a) + v{b) — 1, which 
means M > N. There axe no more cases to consider. 
QED. 


G. Proof of claim in Ex. 13 


First note that we have already established that our 
p obeys conditions (2) and (3) of being a multimetric, 
and therefore only need to establish that it is strictly 
concave. That will be the case iff p(au + (1 — a)v) < 
ap(u) + (1 - a)p(v) Vu £ T(X),v £ T(X),a £ [0,1]. 
Square both sides of this inequality and cancel terms. 
Then exploit the Cauchy Schwaxz inequality for (. | .), 
established in the proof of the claim in Ex. 3. QED. 
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